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Abstract 

Inspired by recent successes of deep learning 
in computer vision, we propose a novel frame¬ 
work for encoding time series as different types 
of images, namely, Gramian Angular Summa¬ 
tion/Difference Fields (GASF/GADF) and Markov 
Transition Fields (MTF). This enables the use of 
techniques from computer vision for time series 
classification and imputation. We used Tiled Con¬ 
volutional Neural Networks (tiled CNNs) on 20 
standard datasets to learn high-level features from 
the individual and compound GASF-GADF-MTF 
images. Our approaches achieve highly competi¬ 
tive results when compared to nine of the current 
best time series classification approaches. Inspired 
by the bijection property of GASF on 0/1 rescaled 
data, we train Denoised Auto-encoders (DA) on the 
GASF images of four standard and one synthesized 
compound dataset. The imputation MSE on test 
data is reduced by 12.18%-48.02% when compared 
to using the raw data. An analysis of the features 
and weights learned via tiled CNNs and DAs ex¬ 
plains why the approaches work. 

1 Introduction 

Since 2006, the techniques developed from deep neural net¬ 
works (or, deep learning) have greatly impacted natural lan¬ 
guage processing, speech recognition and computer vision 
research [Bengio, 2009; Deng and Yu, 2014]. One suc¬ 
cessful deep learning architecture used in computer vision is 
convolutional neural networks (CNN) [LeCun et al ., 1998]. 
CNNs exploit translational invariance by extracting features 
through receptive fields [Hubei and Wiesel, 1962] and learn¬ 
ing with weight sharing, becoming the state-of-the-art ap¬ 
proach in various image recognition and computer vision 
tasks [Krizhevsky et al ., 2012]. Since unsupervised pretrain¬ 
ing has been shown to improve performance [Erhan et al ., 
2010], sparse coding and Topographic Independent Compo¬ 
nent Analysis (TICA) are integrated as unsupervised pretrain¬ 
ing approaches to learn more diverse features with complex 
invariances [Kavukcuoglu et al ., 2010; Ngiam et al ., 2010]. 

Along with the success of unsupervised pretraining applied 
in deep learning, others are studying unsupervised learning 


algorithms for generative models, such as Deep Belief Net¬ 
works (DBN) and Denoised Auto-encoders (DA) [Hinton 
et al ., 2006; Vincent et al ., 2008]. Many deep generative 
models are developed based on energy-based model or auto¬ 
encoders. Temporal autoencoding is integrated with Restrict 
Boltzmann Machines (RBMs) to improve generative mod¬ 
els [Hausler et al , 2013]. A training strategy inspired by 
recent work on optimization-based learning is proposed to 
train complex neural networks for imputation tasks [Brakel 
et al ., 2013]. A generalized Denoised Auto-encoder ex¬ 
tends the theoretical framework and is applied to Deep Gen¬ 
erative Stochastic Networks (DGSN) [Bengio et al ., 2013; 
Bengio and Thibodeau-Laufer, 2013]. 

Inspired by recent successes of supervised and unsuper¬ 
vised learning techniques in computer vision, we consider the 
problem of encoding time series as images to allow machines 
to “visually” recognize, classify and learn structures and pat¬ 
terns. Reformulating features of time series as visual clues 
has raised much attention in computer science and physics. In 
speech recognition systems, acoustic/speech data input is typ¬ 
ically represented by concatenating Mel-frequency cepstral 
coefficients (MFCCs) or perceptual linear predictive coeffi¬ 
cient (PLPs) [Hermansky, 1990]. Recently, researchers are 
trying to build different network structures from time series 
for visual inspection or designing distance measures. Re¬ 
currence Networks were proposed to analyze the structural 
properties of time series from complex systems [Donner et 
al ., 2010; 2011]. They build adjacency matrices from the 
predefined recurrence functions to interpret the time series as 
complex networks. Silva et al. extended the recurrence plot 
paradigm for time series classification using compression dis¬ 
tance [Silva et al ., 2013]. Another way to build a weighted 
adjacency matrix is extracting transition dynamics from the 
first order Markov matrix [Campanharo et al., 2011]. Al¬ 
though these maps demonstrate distinct topological proper¬ 
ties among different time series, it remains unclear how these 
topological properties relate to the original time series since 
they have no exact inverse operations. 

We present three novel representations for encoding time 
series as images that we call the Gramian Angular Summa¬ 
tion/Difference Field (GASF/GADF) and the Markov Transi¬ 
tion Field (MTF). We applied deep Tiled Convolutional Neu¬ 
ral Networks (Tiled CNN) [Ngiam et al ., 2010] to classify 
time series images on 20 standard datasets. Our experimental 
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Figure 1: Illustration of the proposed encoding map of 
Gramian Angular Fields. X is a sequence of rescaled time se¬ 
ries in the ’Fish’ dataset. We transform X into a polar coordi¬ 
nate system by eq. (3) and finally calculate its GASF/GADF 
images with eqs. (5) and (7). In this example, we build GAFs 
without PAA smoothing, so the GAFs both have high resolu¬ 
tion. 


results demonstrate our approaches achieve the best perfor¬ 
mance on 9 of 20 standard dataset compared with 9 previous 
and current best classification methods. Inspired by the bi- 
jection property of GASF on 0/1 rescaled data, we train the 
Denoised Auto-encoder (DA) on the GASF images of 4 stan¬ 
dard and a synthesized compound dataset. The imputation 
MSE on test data is reduced by 12.18%-48.02% compared to 
using the raw data. An analysis of the features and weights 
learned via tiled CNNs and DA explains why the approaches 
work. 


2 Imaging Time Series 

We first introduce our two frameworks for encoding time se¬ 
ries as images. The first type of image is a Gramian Angular 
Field (GAF), in which we represent time series in a polar co¬ 
ordinate system instead of the typical Cartesian coordinates. 
In the Gramian matrix, each element is actually the cosine of 
the summation of angles. Inspired by previous work on the 
duality between time series and complex networks [Campan- 
haro et al., 2011], the main idea of the second framework, 
the Markov Transition Field (MTF), is to build the Markov 
matrix of quantile bins after discretization and encode the dy¬ 
namic transition probability in a quasi-Gramian matrix. 


2.1 Gramian Angular Field 

Given a time series X = {xi, •••, x n } of n real-valued ob¬ 
servations, we rescale X so that all values fall in the interval 
[-1,1] or [0,1] by: 


or 


~i (xi~max(X)-\-(xi—min(X)) 

^—1 max(X)—min(X ) 

~i _ Xj-min(X) 

0 max(X)—min(X ) 


(1) 

( 2 ) 


Thus we can represent the rescaled time series X in polar 
coordinates by encoding the value as the angular cosine and 
the time stamp as the radius with the equation below: 


In the equation above, U is the time stamp and N is a con¬ 
stant factor to regularize the span of the polar coordinate sys¬ 
tem. This polar coordinate based representation is a novel 
way to understand time series. As time increases, correspond¬ 
ing values warp among different angular points on the span¬ 
ning circles, like water rippling. The encoding map of equa¬ 
tion 3 has two important properties. First, it is bijective as 
cos(0) is monotonic when 0 G [0,7r]. Given a time series, 
the proposed map produces one and only one result in the po¬ 
lar coordinate system with a unique inverse map. Second, as 
opposed to Cartesian coordinates, polar coordinates preserve 
absolute temporal relations. We will discuss this in more de¬ 
tail in future work. 

Rescaled data in different intervals have different angular 
bounds. [0,1] corresponds to the cosine function in [0, |], 
while cosine values in the interval [—1,1] fall into the angu¬ 
lar bounds [0,7r]. As we will discuss later, they provide dif¬ 
ferent information granularity in the Gramian Angular Field 
for classification tasks, and the Gramian Angular Difference 
Field (GADF) of [0,1] rescaled data has the accurate inverse 
map. This property actually lays the foundation for imputing 
missing value of time series by recovering the images. 

After transforming the rescaled time series into the polar 
coordinate system, we can easily exploit the angular perspec¬ 
tive by considering the trigonometric sum/difference between 
each point to identify the temporal correlation within differ¬ 
ent time intervals. The Gramian Summation Angular Field 
(GASF) and Gramian Difference Angular Field (GADF) are 
defined as follows: 

GASF = [cos(0i + ^)] (4) 

= x' ■ x - \Ti - x 2 ' • Gz-x 2 (5) 

GADF = [sin(<^ - <j>j)] (6) 

= \Ti - x 2 ' • X - X' • G/-X 2 (7) 

I is the unit row vector [1,1,...,1]. After transforming to 
the polar coordinate system, we take time series at each time 
step as a 1-D metric space. By defining the inner product < 
x,y >= x-y — \/l — x 2 -\Jl — y 2 and < x,y >= a/1 — x 2 • 
y-x- \/\ — y 2 , two types of Gramian Angular Fields (GAFs) 
are actually quasi-Gramian matrices [< xi,x± >]. 1 

The GAFs have several advantages. First, they provide a 
way to preserve temporal dependency, since time increases as 
the position moves from top-left to bottom-right. The GAFs 
contain temporal correlations because G^j\\i-j\ = k) repre¬ 
sents the relative correlation by superposition/difference of 
directions with respect to time interval k. The main diago¬ 
nal G^i is the special case when k = 0, which contains the 
original value/angular information. From the main diagonal, 
we can reconstruct the time series from the high level features 
learned by the deep neural network. However, the GAFs are 
large because the size of the Gramian matrix is n x n when 
the length of the raw time series is n. To reduce the size of 


< j) = arccos (£*), 
r = 


— l<Xi<l,£iEX 

%,U e N 


( 3 ) 


1 ’quasi’ because the functions < x, y > we defined do not sat¬ 

isfy the property of linearity in inner-product space. 
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Figure 2: Illustration of the proposed encoding map of 
Markov Transition Fields. X is a sequence of time-series in 
the ’ECG’ dataset. X is first discretized into Q quantile bins. 
Then we calculate its Markov Transition Matrix W and fi¬ 
nally build its MTF with eq. (8). 


the GAFs, we apply Piecewise Aggregation Approximation 
(PAA) [Keogh and Pazzani, 2000] to smooth the time series 
while preserving the trends. The full pipeline for generating 
the GAFs is illustrated in Figure 1. 

2.2 Markov Transition Field 

We propose a framework similar to Campanharo et al. for 
encoding dynamical transition statistics, but we extend that 
idea by representing the Markov transition probabilities se¬ 
quentially to preserve information in the time domain. 

Given a time series X, we identify its Q quantile bins and 
assign each Xi to the corresponding bins qj (j G [1, Q ]). Thus 
we construct a Q x Q weighted adjacency matrix W by count¬ 
ing transitions among quantile bins in the manner of a first- 
order Markov chain along the time axis. Wij is given by the 
frequency with which a point in quantile qj is followed by a 
point in quantile qi. After normalization by ^ . wq — 1, W 
is the Markov transition matrix. It is insensitive to the distri¬ 
bution of X and temporal dependency on time steps t t . How¬ 
ever, our experimental results on W demonstrate that getting 
rid of the temporal dependency results in too much informa¬ 
tion loss in matrix W. To overcome this drawback, we define 
the Markov Transition Field (MTF) as follows: 



W ij\x 1 eqi,x 1 Eqj 

W ij\x 1 Eqi,x ri eqj 

M = 

W ij\x 2 eqi,xieqj 

W ij\x 2 eqi,x n eqj 


- W ij\x n eqi,x 1 eq j 

W ij\xn£qi,x n eqj- 


We build aQxQ Markov transition matrix (W) by divid¬ 
ing the data (magnitude) into Q quantile bins. The quantile 
bins that contain the data at time stamp i and j (temporal axis) 
are qi and qj (q G [1, Q]). Mij in the MTF denotes the tran¬ 
sition probability of qi —X qj. That is, we spread out matrix 
W which contains the transition probability on the magnitude 
axis into the MTF matrix by considering the temporal posi¬ 
tions. 

By assigning the probability from the quantile at time step 
i to the quantile at time step j at each pixel M^, the MTF 
M actually encodes the multi-span transition probabilities of 
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Figure 3: Structure of the tiled convolutional neural networks. 
We fix the size of receptive fields to 8 x 8 in the first convolu¬ 
tional layer and 3 x 3 in the second convolutional layer. Each 
TICA pooling layer pools over a block of 3 x 3 input units 
in the previous layer without warping around the borders to 
optimize for sparsity of the pooling units. The number of 
pooling units in each map is exactly the same as the number 
of input units. The last layer is a linear SVM for classifica¬ 
tion. We construct this network by stacking two Tiled CNNs, 
each with 6 maps (l = 6) and tiling size k = 1,2,3. 


the time series. denotes the transition probabil¬ 

ity between the points with time interval k. For example, 
Mij\j- i=1 illustrates the transition process along the time 
axis with a skip step. The main diagonal Mu, which is a 
special case when k = 0 captures the probability from each 
quantile to itself (the self-transition probability) at time step 
i. To make the image size manageable and computation more 
efficient, we reduce the MTF size by averaging the pixels in 
each non-overlapping m x m patch with the blurring kernel 
{^2 }mxm- That is, we aggregate the transition probabilities 
in each subsequence of length m together. Figure 2 shows the 
procedure to encode time series to MTF. 

3 Classify Time Series Using GAF/MTF with 
Tiled CNNs 

We apply Tiled CNNs to classify time series using GAF and 
MTF representations on 20 datasets from [Keogh et al , 2011] 
in different domains such as medicine, entomology, engineer¬ 
ing, astronomy, signal processing, and others. The datasets 
are pre-split into training and testing sets to facilitate ex¬ 
perimental comparisons. We compare the classification er¬ 
ror rate of our GASF-GADF-MTF approach with previously 
published results of 3 competing methods and 6 best ap¬ 
proaches proposed recently: early state-of-the-art INN clas¬ 
sifiers based on Euclidean distance and DTW (Best Warp¬ 
ing Window and no Warping Window), Fast-Shapelets[Rak- 
thanmanon and Keogh, 2013], a INN classifier based on 
SAX with Bag-of-Patterns (SAX-BoP) [Lin et al., 2012], a 
SAX based Vector Space Model (SAX-VSM)tSenin and Ma- 
linchik, 2013], a classifier based on the Recurrence Patterns 
Compression Distance(RPCD) [Silva et al., 2013], a tree- 
based symbolic representation for multivariate time series 
(SMTS) [Baydogan and Runger, 2014] and a SVM classifier 
based on a bag-of-features representation (TSBF) [Baydogan 









































Table 1: Summary of error rates for 3 classic baselines, 6 recently published best results and our approach. The symbols <, *, 
f and • represent datasets generated from human motions, figure shapes, synthetically predefined procedures and ah remaining 
temporal signals, respectively. For our approach, the numbers in brackets are the optimal PAA size and quantile size. 


Dataset 

INN- 

RAW 

1NN-DTW- 

BWW 

1NN-DTW- 

nWW 

Fast- 

Shapelet 

SAX- 

BoP 

SAX- 

VSM 

RPCD 

SMTS 

TSBF 

GASF-GADF- 

MTF 

50words • 

0.369 

0.242 

0.31 

N/A 

0.466 

N/A 

0.2264 

0.289 

0.209 

0.301 (16, 32) 

Adiac * 

0.389 

0.391 

0.396 

0.514 

0.432 

0.381 

0.3836 

0.248 

0.245 

0.373 (32, 48) 

Beef • 

0.467 

0.467 

0.5 

0.447 

0.433 

0.33 

0.3667 

0.26 

0.287 

0.233 (64, 40) 

CBF f 

0.148 

0.004 

0.003 

0.053 

0.013 

0.02 

N/A 

0.02 

0.009 

0.009 (32,24) 

Coffee • 

0.25 

0.179 

0.179 

0.068 

0.036 

0 

0 

0.029 

0.004 

0 (64,48) 

ECG • 

0.12 

0.12 

0.23 

0.227 

0.15 

0.14 

0.14 

0.159 

0.145 

0.09 (8,32) 

FaceAll * 

0.286 

0.192 

0.192 

0.411 

0.219 

0.207 

0.1905 

0.191 

0.234 

0.237 (8, 48) 

FaceFour * 

0.216 

0.114 

0.17 

0.090 

0.023 

0 

0.0568 

0.165 

0.051 

0.068 (8, 16) 

fish * 

0.217 

0.16 

0.167 

0.197 

0.074 

0.017 

0.1257 

0.147 

0.08 

0.114(8,40) 

Gun_Point < 

0.087 

0.087 

0.093 

0.061 

0.027 

0.007 

0 

0.011 

0.011 

0.08 (32, 32) 

Lighting2 • 

0.246 

0.131 

0.131 

0.295 

0.164 

0.196 

0.2459 

0.269 

0.257 

0.114 (16, 40) 

Lighting? • 

0.425 

0.288 

0.274 

0.403 

0.466 

0.301 

0.3562 

0.255 

0.262 

0.260 (16, 48) 

OliveOil • 

0.133 

0.167 

0.133 

0.213 

0.133 

0.1 

0.1667 

0.177 

0.09 

0.2 (8, 48) 

OSULeaf * 

0.483 

0.384 

0.409 

0.359 

0.256 

0.107 

0.3554 

0.377 

0.329 

0.358 (16, 32) 

SwedishLeaf * 

0.213 

0.157 

0.21 

0.269 

0.198 

0.01 

0.0976 

0.08 

0.075 

0.065 (16,48) 

synthetic control f 

0.12 

0.017 

0.007 

0.081 

0.037 

0.251 

N/A 

0.025 

0.008 

0.007 (64,48) 

Trace f 

0.24 

0.01 

0 

0.002 

0 

0 

N/A 

0 

0.02 

0 (64,48) 

Two Patterns f 

0.09 

0.0015 

0 

0.113 

0.129 

0.004 

N/A 

0.003 

0.001 

0.091 (64, 32) 

wafer • 

0.005 

0.005 

0.02 

0.004 

0.003 

0.0006 

0.0034 

0 

0.004 

0 (64,16) 

yoga * 

0.17 

0.155 

0.164 

0.249 

0.17 

0.164 

0.134 

0.094 

0.149 

0.196 (8,32) 

#wins 

0 

0 

3 

0 

1 

5 

3 

4 

4 

9 


et al. , 2013]. 

3.1 Tiled Convolutional Neural Networks 

Tiled Convolutional Neural Networks are a variation of Con¬ 
volutional Neural Networks that use tiles and multiple fea¬ 
ture maps to learn invariant features. Tiles are parame¬ 
terized by a tile size k to control the distance over which 
weights are shared. By producing multiple feature maps, 
Tiled CNNs learn overcomplete representations through un¬ 
supervised pretraining with Topographic ICA (TICA). For the 
sake of space, please refer to [Ngiam et al ., 2010] for more 
details. The structure of Tiled CNNs applied in this paper is 
illustrated in Figure 3. 

3.2 Experiment Setting 

In our experiments, the size of the GAF image is regulated 
by the the number of PAA bins Sqaf • Given a time se¬ 
ries X of size n, we divide the time series into Sqaf ad¬ 
jacent, non-overlapping windows along the time axis and ex¬ 
tract the means of each bin. This enables us to construct the 
smaller GAF matrix Gs gaf xS gaf • MTF requires the time 
series to be discretized into Q quantile bins to calculate the 
Q x Q Markov transition matrix, from which we construct 
the raw MTF image M nxn afterwards. Before classifica¬ 
tion, we shrink the MTF image size to Smtf x Smtf by 
the blurring kernel {Ar} mX m where m = The 

Tiled CNN is trained with image size {Sgaf, Smtf} £ 
{16, 24, 32,40,48} and quantile size Q G {8,16, 32, 64}. At 
the last layer of the Tiled CNN, we use a linear soft margin 
SVM [Fan et al ., 2008] and select C by 5-fold cross valida¬ 
tion over {10 -4 ,10 —3 ,..., 10 4 } on the training set. 


For each input of image size Sqaf or Smtf and quan¬ 
tile size Q , we pretrain the Tiled CNN with the full unlabeled 
dataset (both training and test set) to learn the initial weights 
W through TICA. Then we train the SVM at the last layer 
by selecting the penalty factor C with cross validation. Fi¬ 
nally, we classify the test set using the optimal hyperparame¬ 
ters {5, Q, C} with the lowest error rate on the training set. If 
two or more models tie, we prefer the larger S and Q because 
larger S helps preserve more information through the PAA 
procedure and larger Q encodes the dynamic transition statis¬ 
tics with more detail. Our model selection approach provides 
generalization without being overly expensive computation¬ 
ally. 

3.3 Results and Discussion 

We use Tiled CNNs to classify the single GASF, GADF and 
MTF images as well as the compound GASF-GADF-MTF 
images on 20 datasets. For the sake of space, we do not show 
the full results on single-channel images. Generally, our ap¬ 
proach is not prone to overfitting by the relatively small dif¬ 
ference between training and test set errors. One exception is 
the Olive Oil dataset with the MTF approach where the test 
error is significantly higher. 

In addition to the risk of potential overfitting, we found 
that MTF has generally higher error rates than GAFs. This 
is most likely because of the uncertainty in the inverse map 
of MTF. Note that the encoding function from —1/1 rescaled 
time series to GAFs and MTF are both surjections. The map 
functions of GAFs and MTF will each produce only one im¬ 
age with fixed S and Q for each given time series X . Be¬ 
cause they are both surjective mapping functions, the inverse 
image of both mapping functions is not fixed. However, the 
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Figure 4: Pipeline of time series imputation by image re¬ 
covery. Raw GASF —>• ’’broken” GASF —>> recovered GASF 
(top), Raw time series —>> corrupted time series with missing 
value —» predicted time series (bottom) on dataset ’’Swedish- 
Leaf ’ (left) and ”ECG” (right). 

mapping function of GAFs on 0/1 rescaled time series are 
bijective. As shown in a later section, we can reconstruct 
the raw time series from the diagonal of GASF, but it is very 
hard to even roughly recover the signal from MTF. Even for 
— 1/1 rescaled data, the GAFs have smaller uncertainty in 
the inverse image of their mapping function because such 
randomness only comes from the ambiguity of cos(0) when 
(j) E [0, 2tt]. MTF, on the other hand, has a much larger in¬ 
verse image space, which results in large variations when we 
try to recover the signal. Although MTF encodes the transi¬ 
tion dynamics which are important features of time series, 
such features alone seem not to be sufficient for recogni¬ 
tion/classification tasks. 

Note that at each pixel, Gij denotes the supersti¬ 
tion/difference of the directions at ti and tj , M.\j is the tran¬ 
sition probability from the quantile at ti to the quantile at tj . 
GAF encodes static information while MTF depicts informa¬ 
tion about dynamics. From this point of view, we consider 
them as three “orthogonal” channels, like different colors in 
the RGB image space. Thus, we can combine GAFs and MTF 
images of the same size (i.e. Sqafs = Smtf ) to construct a 
triple-channel image (GASF-GADF-MTF). It combines both 
the static and dynamic statistics embedded in the raw time 
series, and we posit that it will be able to enhance classifica¬ 
tion performance. In the experiments below, we pretrain and 
tune the Tiled CNN on the compound GASF-GADF-MTF 
images. Then, we report the classification error rate on test 
sets. In Table 1, the Tiled CNN classifiers on GASF-GADF- 
MTF images achieved significantly competitive results with 
9 other state-of-the-art time series classification approaches. 

4 Image Recovery on GASF for Time Series 
Imputation with Denoised Auto-encoder 

As previously mentioned, the mapping functions from —1/1 
rescaled time series to GAFs are surjections. The uncertainty 
among the inverse images come from the ambiguity of the 
cos(0) when (j> E [0,27r]. However the mapping functions 
of 0/1 rescaled time series are bijections. The main diagonal 
of GASF, i.e. {Go} = {cos(2^)} allows us to precisely 
reconstruct the original time series by 

. /cos(20) + 1 , r 7T, 

cos {<j>) = y - - - <t> e [0, -] (9) 

Thus, we can predict missing values among time series 


through recovering the ’’broken” GASF images. During train¬ 
ing, we manually add ”salt-and-pepper” noise (i.e., randomly 
set a number of points to 0) to the raw time series and trans¬ 
form the data to GASF images. Then a single layer Denoised 
Auto-encoder (DA) is fully trained as a generative model to 
reconstruct GASF images. Note that at the input layer, we 
do not add noise again to the ’’broken” GASF images. A 
Sigmoid function helps to learn the nonlinear features at the 
hidden layer. At the last layer we compute the Mean Square 
Error (MSE) between the original and ’’broken” GASF im¬ 
ages as the loss function to evaluate fitting performance. To 
train the models simple batch gradient descent is applied to 
back propagate the inference loss. For testing, after we cor¬ 
rupt the time series and transform the noisy data to ’’broken” 
GASF, the trained DA helps recover the image, on which we 
extract the main diagonal to reconstruct the recovered time 
series. To compare the imputation performance, we also test 
standard DA with the raw time series data as input to recover 
the missing values (Figure. 4). 

4.1 Experiment Setting 

For the DA models we use batch gradient descent with a batch 
size of 20. Optimization iterations run until the MSE changed 
less than a threshold of 10 _3 for GASF and 10 _5 for raw 
time series. A single hidden layer has 500 hidden neurons 
with sigmoid functions. We choose four dataset of different 
types from the UCR time series repository for the imputation 
task: ”Gun Point” (human motion), ”CBF” (synthetic data), 
’’SwedishLeaf ’ (figure shapes) and ”ECG” (other remaining 
temporal signals). To explore if the statistical dependency 
learned by the DA can be generalized to unknown data, we 
use the above four datasets and the ’’Adiac” dataset together 
to train the DA to impute two totally unknown test datasets, 
’’Two Patterns” and ’’wafer” (We name these synthetic miscel¬ 
laneous datasets ”7 Misc”). To add randomness to the input of 
DA, we randomly set 20% of the raw data among a specific 
time series to be zero (salt-and-pepper noise). Our experi¬ 
ments for imputation are implemented with Theano [Bastien 
et al ., 2012]. To control for the random initialization of the 
parameters and the randomness induced by gradient descent, 
we repeated every experiment 10 times and report the average 
MSE. 

4.2 Results and Discussion 


Table 2: MSE of imputation on time series using raw data and 
GASF images. 


Dataset 

Full MSE 

Interpolation MSE 


Raw 

GASF 

Raw 

GASF 

ECG 

0.01001 

0.01148 

0.02301 

0.01196 

CBF 

0.02009 

0.03520 

0.04116 

0.03119 

Gun Point 

0.00693 

0.00894 

0.01069 

0.00841 

SwedishLeaf 

0.00606 

0.00889 

0.01117 

0.00981 

7 Misc 

0.06134 

0.10130 

0.10998 

0.07077 


In Table 2, ’’Full MSE” means the MSE between the com¬ 
plete recovered and original sequence and ’’Imputation MSE” 












































Figure 5: (a) Original GASF and its six learned feature maps 
before the SVM layer in Tiled CNNs (left), (b) Raw time 
series and its reconstructions from the main diagonal of six 
feature maps on ’50Words’ dataset (right). 

means the MSE of only the unknown points among each time 
series. Interestingly, DA on the raw data perform well on the 
whole sequence, generally, but there is a gap between the full 
MSE and imputation MSE. That is, DA on raw time series 
can fit the known data much better than predicting the un¬ 
known data (like overfitting). Predicting the missing value 
using GASF always achieves slightly higher full MSE but the 
imputation MSE is reduced by 12.18%-48.02%. We can ob¬ 
serve that the difference between the full MSE and imputation 
MSE is much smaller on GASF than on the raw data. Inter¬ 
polation with GASF has more stable performance than on the 
raw data. 

Why does predicting missing values using GASF have 
more stable performance than using raw time series? Actu¬ 
ally, the transformation maps of GAFs are generally equiva¬ 
lent to a kernel trick. By defining the inner product k(xi,Xj), 
we achieve data augmentation by increasing the dimension¬ 
ality of the raw data. By preserving the temporal and spatial 
information in GASF images, the DA utilizes both temporal 
and spatial dependencies by considering the missing points as 
well as their relations to other data that has been explicitly en¬ 
coded in the GASF images. Because the entire sequence, in¬ 
stead of a short subsequence, helps predict the missing value, 
the performance is more stable as the full MSE and imputa¬ 
tion MSE are close. 

5 Analysis on Features and Weights Learned 
by Tiled CNNs and DA 

In contrast to the cases in which the CNNs is applied in nat¬ 
ural image recognition tasks, neither GAFs nor MTF have 
natural interpretations of visual concepts like “edges” or “an¬ 
gles”. In this section we analyze the features and weights 
learned through Tiled CNNs to explain why our approach 
works. 

Figure 5 illustrates the reconstruction results from six fea¬ 
ture maps learned through the Tiled CNNs on GASF (by Eqn 
9). The Tiled CNNs extracts the color patch, which is essen¬ 
tially a moving average that enhances several receptive fields 
within the nonlinear units by different trained weights. It is 
not a simple moving average but the synthetic integration by 
considering the 2D temporal dependencies among different 
time intervals, which is a benefit from the Gramian matrix 
structure that helps preserve the temporal information. By ob¬ 
serving the orthogonal reconstruction from each layer of the 
feature maps, we can clearly observe that the tiled CNNs can 
extract the multi-frequency dependencies through the convo¬ 



Figure 6: All 500 filters learned by DA on the ”Gun Point” 
(left) and ”7 Misc” (right) dataset. 

lution and pooling architecture on the GAF and MTF images 
to preserve the trend while addressing more details in differ¬ 
ent subphases. The high-leveled feature maps learned by the 
Tiled CNN are equivalent to a multi-frequency approxima¬ 
tor of the original curve. Our experiments also demonstrates 
the learned weight matrix W with the constraint WW T = /, 
which makes effective use of local orthogonality. The TICA 
pretraining provides the built-in advantage that the function 
w.r.t the parameter space is not likely to be ill-conditioned as 
WW T = 1. The weight matrix W is quasi-orthogonal and 
approaching 0 without large magnitude. This implies that the 
condition number of W approaches 1 and helps the system to 
be well-conditioned. 

As for imputation, because the GASF images have no con¬ 
cept of’’angle” and ’’edge”, DA actually learned different pro¬ 
totypes of the GASF images (Table 6). We find that there is 
significant noise in the filters on the ”7 Misc” dataset because 
the training set is relatively small to better learn different fil¬ 
ters. Actually, all the noisy filters with no patterns work like 
one Gaussian noise filter. 

6 Conclusions and Future Work 

We created a pipeline for converting time series into novel 
representations, GASF, GADF and MTF images, and ex¬ 
tracted multi-level features from these using Tiled CNN and 
DA for classification and imputation. We demonstrated 
that our approach yields competitive results for classifica¬ 
tion when compared to recently best methods. Imputation 
using GASF achieved better and more stable performance 
than on the raw data using DA. Our analysis of the features 
learned from Tiled CNN suggested that Tiled CNN works like 
a multi-frequency moving average that benefits from the 2D 
temporal dependency that is preserved by Gramian matrix. 
Features learned by DA on GASF is shown to be different 
prototype, as correlated basis to construct the raw images. 

Important future work will involve developing recurrent 
neural nets to process streaming data. We are also quite inter¬ 
ested in how different deep learning architectures perform on 
the GAFs and MTF images. Another important future work 
is to learn deep generative models with more high-level fea¬ 
tures on GAFs images. We aim to further apply our time se¬ 
ries models in real world regression/imputation and anomaly 
detection tasks. 
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